Exploration of Projection Spaces¶
# Feel free to add dependencies, but make sure that they are included in environment.yml
#disable some annoying warnings
import warnings
warnings.filterwarnings('ignore', category=FutureWarning)
#plots the figures in place instead of a new window
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import altair as alt
from altair import datum
alt.data_transformers.disable_max_rows()
from sklearn import manifold
from openTSNE import TSNE
from sklearn.manifold import TSNE
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import ParameterGrid
import seaborn as sns
from mpl_toolkits.mplot3d import Axes3D
import plotly.express as px
# Please install this to be able to read excel, we put this into the .yml file as well in case...
#pip install openpyxl
Data¶
To be able to explore paths in a projected space, you need to pick a problem/algorithm/model that consists of multiple states that change iteratively.
Click to see an Example
An example is the solving of a Rubik's Cube. After each rotation the state of the cube changes. This results in a path from the initial state, through the individual rotations, to the solved cube. By using projection, we can examine the individual states and paths in the two-dimensional space. Depending on the initial state and the solution strategy the paths will differ or resemble each other.
This is an example of solving 10 randomly scrambled Rubik's Cubes with two different strategies, the Beginner (in green) and the Fridrich Method (in orange):
Read and Prepare Data¶
Read in your data from a file or create your own data.
Document any data processing steps.
# TODO
import os
import re
import pandas as pd
# Define board coordinates
board_size = 19
coordinates = [chr(97 + row) + chr(97 + col) for row in range(board_size) for col in range(board_size)]
# Define columns for the final DataFrame
columns = ["game_id", "move_id", "color", "winner_color", "winner_score", "result", "rules", "handicap", "starter_player", "step_count"] + coordinates
# Initialize an empty DataFrame
all_games_df = pd.DataFrame(columns=columns)
# Helper functions
def parse_metadata_and_moves(sgf_content):
moves = []
# Determine rules and handicap
rules = "Unknown"
handicap = 0
if "RU[" in sgf_content:
rules = sgf_content.split("RU[")[1].split("]")[0]
if "HA[" in sgf_content:
handicap = int(sgf_content.split("HA[")[1].split("]")[0])
# Determine winner information
winner_color = "black" if "RE[B+" in sgf_content else "white"
result = "resign" if "Resign" in sgf_content else "score"
winner_score = None
if result == "score":
winner_score = sgf_content.split("RE[")[1].split("]")[0][2:]
# Extract moves
sgf_moves = sgf_content.split(";")[2:] # Skip the first two header parts
move_id = 1
for move in sgf_moves:
color = "black" if move.startswith("B") else "white"
pos = move[2:4]
moves.append((move_id, color, pos))
move_id += 1
# Determine starter player based on the first move
starter_player = moves[0][1] if moves else "unknown" # "unknown" if there are no moves
return moves, winner_color, winner_score, result, rules, handicap, starter_player
# Initialize board with all 0s
def initialize_board():
return {coord: 0 for coord in coordinates}
# Apply moves and log each step for a single game
def process_game(game_id, moves, winner_color, winner_score, result, rules, handicap, starter_player):
board_state = initialize_board()
data = []
step_count = len(moves)
for move_id, color, pos in moves:
# Update board with current move
board_state[pos] = 2 if color == "black" else 1 # 2 for Black, 1 for White
row_data = {
"game_id": game_id,
"move_id": move_id,
"color": color,
"winner_color": winner_color,
"winner_score": winner_score,
"result": result,
"rules": rules,
"handicap": handicap,
"starter_player": starter_player,
"step_count": step_count
}
row_data.update(board_state)
data.append(row_data.copy())
# Reset the stone to 0 for the next move
board_state[pos] = 0
return data
# Parse and process multiple SGF files in a folder
def process_sgf_folder(folder_path):
global all_games_df
game_files = []
# Gather files with their game IDs
for filename in os.listdir(folder_path):
if filename.endswith(".sgf"):
# Extract the game ID from the filename using regex
match = re.match(r"(\d+)_", filename)
if match:
game_id = int(match.group(1))
game_files.append((game_id, filename))
# Sort files by game_id
game_files.sort(key=lambda x: x[0])
# Process each file in sorted order
for game_id, filename in game_files:
with open(os.path.join(folder_path, filename), "r", encoding="utf-8") as file:
sgf_content = file.read()
moves, winner_color, winner_score, result, rules, handicap, starter_player = parse_metadata_and_moves(sgf_content)
game_data = process_game(game_id, moves, winner_color, winner_score, result, rules, handicap, starter_player)
game_df = pd.DataFrame(game_data, columns=columns)
all_games_df = pd.concat([all_games_df, game_df], ignore_index=True)
# Specify the folder path and the player's color
folder_path = "final\SGF" # Update with the path to your folder
# Process the SGF files
#process_sgf_folder(folder_path) # we dont have to run it again
# Save the merged DataFrame to an Excel file
output_path = "merged_games_data_sorted_final_new.xlsx"
#all_games_df.to_excel(output_path, index=False)
#print(f"Data saved to {output_path}")
"""
# Load the score table
score_table_path = "final\Scores_key.xlsx" # Update with your score table path
score_df = pd.read_excel(score_table_path)
# Rename 'ID' column in score table to 'game_id' to match main DataFrame
score_df = score_df.rename(columns={"Id": "game_id"})
# Merge the score table with all_games_df on 'game_id'
merged_df = pd.merge(all_games_df, score_df, on="game_id", how="left")
# Save the merged DataFrame to an Excel file
output_path = "merged_games_with_scores_final_new.xlsx"
merged_df.to_excel(output_path, index=False)
"""
'\n# Load the score table\nscore_table_path = "final\\Scores_key.xlsx" # Update with your score table path\nscore_df = pd.read_excel(score_table_path)\n\n# Rename \'ID\' column in score table to \'game_id\' to match main DataFrame\nscore_df = score_df.rename(columns={"Id": "game_id"})\n\n# Merge the score table with all_games_df on \'game_id\'\nmerged_df = pd.merge(all_games_df, score_df, on="game_id", how="left")\n\n# Save the merged DataFrame to an Excel file\noutput_path = "merged_games_with_scores_final_new.xlsx"\nmerged_df.to_excel(output_path, index=False)\n\n'
Comments¶
- Did you transform, clean, or extend the data? How/Why?
We downloaded GO games in .sgf files, and read them from a folder, extracted some useful information with text functions from the text, and extracted the steps from each game (board states). Some example meta data features: -player colors, rule, handicap, time, result type...etc
We also used another helper table (GOgame\Scores_key.xlsx), where we got the area and territory scores from the .sgf files visualized on this webpage: https://speedtesting.herokuapp.com/sgfviewer/#google_vignette, made some categories as well, for example whether the player is a beginner/ intermediate / master level.
Threshold: 40 - 80: Beginner 80 - 200: Intermediate 200 - : Master
Then we merged these 2 tables into one based on the game_id-s.
Final table: GOgame\merged_games_with_scores_final.xlsx
Projection¶
Project your data into a 2D space. Try multiple (3+) projection methods (e.g., t-SNE, UMAP, MDS, PCA, ICA, other methods) with different settings and compare them.
Make sure that all additional dependencies are included when submitting.
# Load dataset
data = pd.read_excel('/GOgame/merged_games_with_scores_final_new.xlsx')
# Selecting only the board state columns for dimensionality reduction
board_state_columns = data.loc[:, 'aa':'ss']
info_cols=[col for col in data.columns if col not in board_state_columns]
# Selecting only the board state columns for dimensionality reduction
board_state_columns = data.loc[:, 'aa':'ss']
# player_info_columns for merged_games_with_scores_final.xlsx
player_info_columns = data[info_cols]
Here You can see the meta data features that we collected.
player_info_columns.head()
| game_id | move_id | color | winner_color | winner_score | result | rules | handicap | starter_player | step_count | ... | Date | Our_players_colour | Area score | Area score of opponent | Area_winner_color | Area_result | Territory_score | Territory_score_of_opponent | Territory_winner_color | Territory_result | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 1 | white | white | Time | score | Japanese | 4 | white | 105 | ... | September, 2024 | W | 54.5 | 56.0 | B | 1,5 | 4.5 | 3.0 | W | 1,5 |
| 1 | 1 | 2 | black | white | Time | score | Japanese | 4 | white | 105 | ... | September, 2024 | W | 54.5 | 56.0 | B | 1,5 | 4.5 | 3.0 | W | 1,5 |
| 2 | 1 | 3 | white | white | Time | score | Japanese | 4 | white | 105 | ... | September, 2024 | W | 54.5 | 56.0 | B | 1,5 | 4.5 | 3.0 | W | 1,5 |
| 3 | 1 | 4 | black | white | Time | score | Japanese | 4 | white | 105 | ... | September, 2024 | W | 54.5 | 56.0 | B | 1,5 | 4.5 | 3.0 | W | 1,5 |
| 4 | 1 | 5 | white | white | Time | score | Japanese | 4 | white | 105 | ... | September, 2024 | W | 54.5 | 56.0 | B | 1,5 | 4.5 | 3.0 | W | 1,5 |
5 rows × 23 columns
The board states at each move, the position of the stone placed on the board:
board_state_columns
| aa | ab | ac | ad | ae | af | ag | ah | ai | aj | ... | sj | sk | sl | sm | sn | so | sp | sq | sr | ss | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 9750 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 9751 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 9752 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 9753 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 9754 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
9755 rows × 361 columns
PCA¶
# Standardize the board state data
scaler = StandardScaler()
board_state_scaled = scaler.fit_transform(board_state_columns)
# Apply PCA to the board state columns
pca = PCA()
board_state_pca = pca.fit_transform(board_state_scaled)
# Plot the explained variance ratio
plt.figure(figsize=(10, 6))
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.xlabel('Number of Principal Components')
plt.ylabel('Cumulative Explained Variance')
plt.title('Explained Variance by Principal Components (Board State)')
plt.grid(True)
plt.show()
# Checking how many components explain a significant amount of variance
explained_variance = pca.explained_variance_ratio_
# Summary of PCA
print(f"Explained variance by each component: {explained_variance}")
print(f"Total explained variance: {np.cumsum(explained_variance)}")
Explained variance by each component: [0.00278106 0.00278098 0.00278027 0.00277979 0.00277972 0.00277971 0.00277958 0.00277955 0.00277954 0.00277947 0.00277943 0.00277942 0.00277935 0.00277931 0.00277928 0.00277927 0.00277925 0.00277924 0.00277922 0.0027792 0.00277911 0.00277907 0.00277907 0.00277904 0.00277901 0.002779 0.00277899 0.00277898 0.00277895 0.00277894 0.00277892 0.00277891 0.00277888 0.00277888 0.00277887 0.00277884 0.00277884 0.00277883 0.0027788 0.0027788 0.0027788 0.00277879 0.00277878 0.00277877 0.00277876 0.00277874 0.00277874 0.00277874 0.00277873 0.00277872 0.00277871 0.0027787 0.00277869 0.00277869 0.00277864 0.00277859 0.00277856 0.00277856 0.00277854 0.00277845 0.00277845 0.00277844 0.00277844 0.00277843 0.0027784 0.00277836 0.00277836 0.00277836 0.00277835 0.00277832 0.00277832 0.00277832 0.00277832 0.0027783 0.00277829 0.00277829 0.00277827 0.00277825 0.00277822 0.00277821 0.0027782 0.00277819 0.00277817 0.00277816 0.00277813 0.00277813 0.00277812 0.00277808 0.00277808 0.00277807 0.00277805 0.00277805 0.00277805 0.00277803 0.00277801 0.00277801 0.002778 0.00277799 0.00277798 0.00277798 0.00277797 0.00277796 0.00277796 0.00277796 0.00277795 0.00277794 0.00277794 0.00277794 0.00277793 0.00277791 0.00277785 0.00277785 0.00277785 0.00277785 0.00277785 0.00277785 0.00277783 0.00277781 0.00277781 0.00277781 0.00277781 0.0027778 0.00277777 0.00277777 0.00277777 0.00277777 0.00277777 0.00277777 0.00277777 0.00277775 0.00277774 0.00277774 0.00277773 0.00277771 0.00277771 0.00277771 0.00277771 0.0027777 0.00277769 0.00277769 0.00277768 0.00277768 0.00277767 0.00277766 0.00277766 0.00277766 0.00277766 0.00277763 0.00277761 0.00277761 0.00277759 0.00277757 0.00277757 0.00277757 0.00277755 0.00277754 0.00277753 0.00277753 0.00277751 0.0027775 0.00277749 0.00277747 0.00277747 0.00277747 0.00277747 0.00277747 0.00277746 0.00277745 0.00277744 0.00277744 0.00277743 0.00277743 0.00277743 0.00277743 0.00277743 0.00277743 0.00277742 0.00277739 0.00277738 0.00277738 0.00277738 0.00277734 0.00277734 0.00277732 0.00277731 0.00277729 0.00277729 0.00277729 0.00277727 0.00277726 0.00277724 0.00277723 0.00277723 0.0027772 0.00277718 0.00277718 0.00277718 0.00277718 0.00277716 0.00277715 0.00277712 0.0027771 0.0027771 0.00277709 0.00277707 0.00277706 0.00277704 0.00277702 0.00277702 0.002777 0.00277698 0.00277698 0.00277698 0.00277698 0.00277697 0.00277696 0.00277696 0.00277695 0.00277693 0.00277693 0.00277693 0.00277693 0.00277692 0.00277692 0.00277692 0.00277692 0.00277692 0.00277692 0.00277687 0.00277683 0.0027768 0.00277678 0.00277678 0.00277678 0.00277676 0.00277674 0.00277674 0.00277674 0.00277672 0.00277671 0.00277671 0.0027767 0.00277669 0.00277669 0.00277669 0.00277669 0.00277668 0.00277666 0.0027766 0.00277659 0.00277656 0.00277654 0.00277654 0.00277654 0.00277654 0.00277651 0.0027765 0.00277648 0.00277647 0.00277647 0.00277646 0.00277645 0.00277644 0.00277643 0.00277642 0.00277641 0.00277641 0.00277631 0.0027763 0.00277625 0.00277623 0.00277621 0.0027762 0.00277619 0.00277618 0.00277618 0.00277616 0.00277613 0.00277609 0.00277608 0.00277608 0.00277603 0.00277599 0.00277592 0.00277577 0.00277575 0.00277575 0.00277568 0.00277565 0.00277549 0.00277547 0.00277545 0.00277541 0.00277522 0.00277519 0.00277517 0.00277516 0.00277515 0.00277514 0.00277511 0.00277506 0.00277501 0.00277497 0.00277491 0.00277489 0.00277483 0.00277479 0.00277477 0.00277474 0.00277473 0.0027747 0.00277467 0.00277466 0.00277465 0.00277464 0.00277464 0.00277455 0.0027745 0.00277442 0.00277439 0.00277429 0.00277426 0.00277426 0.00277423 0.00277422 0.00277422 0.00277422 0.00277419 0.00277418 0.00277415 0.00277415 0.00277399 0.00277395 0.00277394 0.00277392 0.00277391 0.00277391 0.00277389 0.00277388 0.00277387 0.00277372 0.0027737 0.0027737 0.00277364 0.00277363 0.00277344 0.00277343 0.0027734 0.00277339 0.00277325 0.00277324 0.00277316 0.00277313 0.00277302 0.00277265 0.00277246 0.00277195 0.00277187 0.00277186 0.00277138 0.00028717] Total explained variance: [0.00278106 0.00556204 0.00834232 0.01112211 0.01390183 0.01668154 0.01946112 0.02224068 0.02502022 0.02779969 0.03057913 0.03335855 0.0361379 0.03891721 0.04169649 0.04447576 0.04725501 0.05003426 0.05281348 0.05559267 0.05837178 0.06115086 0.06392993 0.06670896 0.06948797 0.07226698 0.07504596 0.07782494 0.08060389 0.08338284 0.08616176 0.08894067 0.09171955 0.09449842 0.09727729 0.10005613 0.10283496 0.10561379 0.10839259 0.11117139 0.11395019 0.11672898 0.11950776 0.12228653 0.12506529 0.12784403 0.13062277 0.1334015 0.13618023 0.13895895 0.14173766 0.14451636 0.14729505 0.15007374 0.15285239 0.15563098 0.15840954 0.1611881 0.16396664 0.16674509 0.16952354 0.17230198 0.17508043 0.17785885 0.18063726 0.18341562 0.18619398 0.18897234 0.19175069 0.19452902 0.19730734 0.20008566 0.20286398 0.20564229 0.20842057 0.21119886 0.21397713 0.21675537 0.21953359 0.22231181 0.22509001 0.2278682 0.23064637 0.23342453 0.23620266 0.23898079 0.2417589 0.24453699 0.24731507 0.25009315 0.2528712 0.25564924 0.25842729 0.26120532 0.26398333 0.26676135 0.26953935 0.27231734 0.27509532 0.27787331 0.28065128 0.28342924 0.2862072 0.28898516 0.29176311 0.29454105 0.29731899 0.30009694 0.30287487 0.30565277 0.30843062 0.31120847 0.31398632 0.31676417 0.31954202 0.32231987 0.3250977 0.32787551 0.33065332 0.33343112 0.33620893 0.33898673 0.3417645 0.34454227 0.34732004 0.35009781 0.35287558 0.35565336 0.35843113 0.36120888 0.36398662 0.36676436 0.36954209 0.3723198 0.37509751 0.37787523 0.38065294 0.38343064 0.38620833 0.38898602 0.3917637 0.39454138 0.39731905 0.40009671 0.40287438 0.40565204 0.4084297 0.41120733 0.41398494 0.41676255 0.41954015 0.42231772 0.42509529 0.42787286 0.43065041 0.43342795 0.43620548 0.43898301 0.44176053 0.44453802 0.44731551 0.45009298 0.45287045 0.45564792 0.45842539 0.46120286 0.46398032 0.46675776 0.4695352 0.47231264 0.47509007 0.4778675 0.48064493 0.48342236 0.48619979 0.48897722 0.49175465 0.49453204 0.49730942 0.5000868 0.50286418 0.50564153 0.50841886 0.51119618 0.51397349 0.51675078 0.51952808 0.52230537 0.52508264 0.52785989 0.53063713 0.53341436 0.53619158 0.53896878 0.54174596 0.54452315 0.54730033 0.55007751 0.55285466 0.55563181 0.55840893 0.56118603 0.56396313 0.56674023 0.56951729 0.57229435 0.57507139 0.57784841 0.58062543 0.58340243 0.58617941 0.5889564 0.59173338 0.59451036 0.59728732 0.60006428 0.60284124 0.60561818 0.60839511 0.61117205 0.61394898 0.61672591 0.61950283 0.62227975 0.62505667 0.62783359 0.63061051 0.63338743 0.6361643 0.63894113 0.64171793 0.64449471 0.64727149 0.65004827 0.65282503 0.65560177 0.65837852 0.66115526 0.66393199 0.6667087 0.66948541 0.67226211 0.6750388 0.67781548 0.68059217 0.68336886 0.68614554 0.6889222 0.6916988 0.69447539 0.69725195 0.7000285 0.70280504 0.70558159 0.70835813 0.71113464 0.71391115 0.71668763 0.7194641 0.72224057 0.72501704 0.72779349 0.73056993 0.73334636 0.73612278 0.73889919 0.7416756 0.74445192 0.74722822 0.75000446 0.75278069 0.7555569 0.7583331 0.76110929 0.76388547 0.76666165 0.76943781 0.77221394 0.77499003 0.7777661 0.78054218 0.78331821 0.78609421 0.78887012 0.79164589 0.79442165 0.7971974 0.79997308 0.80274873 0.80552422 0.80829969 0.81107514 0.81385055 0.81662577 0.81940096 0.82217613 0.82495129 0.82772643 0.83050158 0.83327669 0.83605175 0.83882676 0.84160173 0.84437664 0.84715154 0.84992637 0.85270115 0.85547593 0.85825067 0.8610254 0.8638001 0.86657477 0.86934942 0.87212407 0.87489871 0.87767335 0.8804479 0.8832224 0.88599682 0.88877121 0.8915455 0.89431976 0.89709402 0.89986825 0.90264247 0.90541669 0.9081909 0.91096509 0.91373927 0.91651342 0.91928757 0.92206156 0.92483551 0.92760945 0.93038337 0.93315728 0.93593119 0.93870507 0.94147895 0.94425282 0.94702654 0.94980024 0.95257394 0.95534759 0.95812121 0.96089465 0.96366808 0.96644148 0.96921487 0.97198812 0.97476136 0.97753452 0.98030765 0.98308067 0.98585332 0.98862577 0.99139772 0.99416958 0.99694144 0.99971283 1. ]
It doesn't make any sense to use PCA on the board states, as no feature (in this case board position) seems to be more important than the others.
PCA for the player info columns
# !!!! winner_score and Rank are excluded here!!!!
# feature names in merged_games_with_scores_final_new.xlsx
numerical_features = ['handicap', 'step_count',
'Area score', 'Area score of opponent', 'Area_result', 'Territory_score',
'Territory_score_of_opponent', 'Territory_result']
categorical_features = ['game_id', 'move_id', 'color',
'winner_color', 'result', 'rules', 'starter_player',
'level', 'Player', 'Our_players_colour', 'Area_winner_color', 'Territory_winner_color']
# Converting numerical features to numeric (int/float)
for feature in numerical_features:
player_info_columns[feature] = pd.to_numeric(player_info_columns[feature], errors='coerce')
# Handling any NaN values that result from conversion
# For example, filling NaN values with the mean of the column
player_info_columns[numerical_features] = player_info_columns[numerical_features].fillna(
player_info_columns[numerical_features].mean()
)
/tmp/ipykernel_8097/578163067.py:12: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy player_info_columns[feature] = pd.to_numeric(player_info_columns[feature], errors='coerce') /tmp/ipykernel_8097/578163067.py:16: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy player_info_columns[numerical_features] = player_info_columns[numerical_features].fillna(
# Define preprocessor with dense output for OneHotEncoder
preprocessor = ColumnTransformer(
transformers=[
('num', StandardScaler(), numerical_features),
('cat', OneHotEncoder(sparse_output=False), categorical_features)
]
)
# Create a pipeline for PCA
pipeline = Pipeline(steps=[
('preprocessor', preprocessor),
('pca', PCA())
])
# Apply PCA to the player info columns
player_info_pca = pipeline.fit_transform(player_info_columns)
# Extract PCA component and explained variance
pca = pipeline.named_steps['pca']
explained_variance = pca.explained_variance_ratio_
# Get feature names from the preprocessor
onehot_feature_names = pipeline.named_steps['preprocessor'].named_transformers_['cat'].get_feature_names_out(categorical_features)
all_feature_names = numerical_features + list(onehot_feature_names)
# Analyze the PCA components to see the most important features
pca_components = pca.components_
# Print out the most important features for each principal component
print("\nMost important features for each principal component:\n")
for i, component in enumerate(pca_components[:10]):
important_features = sorted(zip(all_feature_names, component), key=lambda x: abs(x[1]), reverse=True)
print(f"Principal Component {i + 1}:")
for feature, loading in important_features[:10]: # Print top 10 features for each component
print(f" {feature}: {loading:.4f}")
print()
Most important features for each principal component: Principal Component 1: Territory_score_of_opponent: 0.4368 Area score of opponent: 0.4179 Area score: -0.4079 Territory_result: 0.3632 Area_result: 0.3628 Territory_score: -0.3218 level_1: 0.1539 step_count: 0.1083 level_3: -0.0993 winner_color_black: 0.0784 Principal Component 2: handicap: -0.4796 Territory_score: 0.4090 Territory_result: 0.4058 Area_result: 0.4036 Area score: 0.3122 Area score of opponent: -0.1736 starter_player_black: 0.1335 starter_player_white: -0.1335 result_resign: -0.1215 result_score: 0.1215 Principal Component 3: step_count: -0.6448 Area score of opponent: -0.2564 Area_winner_color_B: 0.2239 Area_winner_color_W: -0.2239 Territory_winner_color_B: 0.2060 Territory_winner_color_W: -0.2060 starter_player_black: -0.1937 starter_player_white: 0.1937 rules_Japanese: 0.1803 Player_mangochia: -0.1795 Principal Component 4: handicap: -0.5985 starter_player_black: 0.3295 starter_player_white: -0.3295 step_count: -0.2515 Territory_score: -0.2351 Area score: -0.2326 Territory_result: -0.2272 Area_result: -0.2236 result_resign: 0.1600 result_score: -0.1600 Principal Component 5: step_count: -0.3695 Area_winner_color_W: 0.3110 Area_winner_color_B: -0.3110 winner_color_white: 0.3062 winner_color_black: -0.3062 Territory_winner_color_W: 0.3045 Territory_winner_color_B: -0.3045 Territory_score_of_opponent: 0.2136 Territory_score: 0.1763 Area_result: 0.1619 Principal Component 6: Territory_score: 0.3927 Territory_score_of_opponent: 0.3471 level_2: -0.3379 result_resign: 0.2836 result_score: -0.2836 level_3: 0.2267 Area score of opponent: 0.2035 rules_Chinese: 0.1831 Territory_winner_color_B: 0.1805 Territory_winner_color_W: -0.1805 Principal Component 7: Area score of opponent: -0.3894 Territory_score_of_opponent: -0.3578 Player_mangochia: 0.3023 rules_Chinese: 0.2473 rules_Japanese: -0.2461 Territory_result: 0.2363 Area_result: 0.2332 Territory_score: -0.2262 handicap: 0.2142 result_resign: 0.2113 Principal Component 8: Our_players_colour_B: -0.4948 Our_players_colour_W: 0.4948 level_2: -0.3096 result_resign: -0.2420 result_score: 0.2420 step_count: -0.2403 level_1: 0.2349 winner_color_black: -0.1620 winner_color_white: 0.1620 rules_Japanese: -0.1210 Principal Component 9: color_white: -0.7071 color_black: 0.7071 result_resign: -0.0034 result_score: 0.0034 step_count: -0.0028 Our_players_colour_B: -0.0023 Our_players_colour_W: 0.0023 Player_mangochia: -0.0023 rules_Chinese: -0.0018 rules_Japanese: 0.0015 Principal Component 10: winner_color_white: -0.3732 winner_color_black: 0.3732 Our_players_colour_B: -0.2926 Our_players_colour_W: 0.2926 result_resign: 0.2832 result_score: -0.2832 Area_winner_color_B: -0.1914 Area_winner_color_W: 0.1914 rules_Japanese: 0.1746 Player_mangochia: -0.1572
# Plot the explained variance ratio
plt.figure(figsize=(10, 6))
plt.plot(np.cumsum(explained_variance))
plt.xlabel('Number of Principal Components')
plt.ylabel('Cumulative Explained Variance')
plt.title('Explained Variance by Principal Components (Player Info)')
plt.grid(True)
plt.show()
# Plot the same explained variance ratio plot as before but only displaying the first 10 principal components
plt.figure(figsize=(10, 6))
plt.plot(np.cumsum(explained_variance)[:10], marker='o')
plt.xlabel('Number of Principal Components')
plt.ylabel('Cumulative Explained Variance')
plt.title('Explained Variance by the first 10 Principal Components (Player Info)')
plt.xticks(ticks=range(10), labels=range(1, 10 + 1))
plt.grid(True)
plt.show()
PCA for every feature (both player info columns and board states)
all_data = pd.concat([player_info_columns, board_state_columns], axis=1)
numerical_features.extend(board_state_columns.columns)
all_data.head()
| game_id | move_id | color | winner_color | winner_score | result | rules | handicap | starter_player | step_count | ... | sj | sk | sl | sm | sn | so | sp | sq | sr | ss | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 1 | white | white | Time | score | Japanese | 4 | white | 105 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 1 | 2 | black | white | Time | score | Japanese | 4 | white | 105 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2 | 1 | 3 | white | white | Time | score | Japanese | 4 | white | 105 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 1 | 4 | black | white | Time | score | Japanese | 4 | white | 105 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 1 | 5 | white | white | Time | score | Japanese | 4 | white | 105 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
5 rows × 384 columns
# Preprocessing: One-Hot Encoding for categorical features and Standardization for numerical features
preprocessor = ColumnTransformer(
transformers=[
('num', StandardScaler(), numerical_features),
('cat', OneHotEncoder(), categorical_features)
]
)
# Create a pipeline for PCA
pipeline = Pipeline(steps=[
('preprocessor', preprocessor),
('pca', PCA())
])
# Apply PCA to the player info columns
all_data_pca = pipeline.fit_transform(all_data)
# Extract PCA component and explained variance
pca = pipeline.named_steps['pca']
explained_variance = pca.explained_variance_ratio_
# Get feature names from the preprocessor
# Note: OneHotEncoder generates multiple columns for each category, so we need to extract all feature names
onehot_feature_names = pipeline.named_steps['preprocessor'].named_transformers_['cat'].get_feature_names_out(categorical_features)
all_feature_names = numerical_features + list(onehot_feature_names)
# Analyze the PCA components to see the most important features
pca_components = pca.components_
# Print out the most important features for each principal component
print("\nMost important features for each principal component:\n")
for i, component in enumerate(pca_components[:10]):
# Get the top features for this component, sorted by the absolute value of their loadings
important_features = sorted(zip(all_feature_names, component), key=lambda x: abs(x[1]), reverse=True)
print(f"Principal Component {i + 1}:")
for feature, loading in important_features[:10]: # Print top 10 features for each component
print(f" {feature}: {loading:.4f}")
print()
Most important features for each principal component: Principal Component 1: Territory_score_of_opponent: 0.4352 Area score of opponent: 0.4163 Area score: -0.4071 Territory_result: 0.3611 Area_result: 0.3607 Territory_score: -0.3214 level_1: 0.1535 step_count: 0.1069 level_3: -0.0992 winner_color_white: -0.0780 Principal Component 2: handicap: -0.4760 Territory_score: 0.3995 Territory_result: 0.3978 Area_result: 0.3957 Area score: 0.3050 Area score of opponent: -0.1660 starter_player_black: 0.1346 starter_player_white: -0.1346 result_resign: -0.1192 result_score: 0.1192 Principal Component 3: step_count: 0.6217 Area score of opponent: 0.2453 Area_winner_color_B: -0.2103 Area_winner_color_W: 0.2103 Territory_winner_color_B: -0.1943 Territory_winner_color_W: 0.1943 starter_player_black: 0.1703 starter_player_white: -0.1703 Player_mangochia: 0.1702 rules_Japanese: -0.1672 Principal Component 4: handicap: -0.5060 starter_player_black: 0.2927 starter_player_white: -0.2927 Territory_result: -0.2093 Area_result: -0.2066 Territory_score: -0.1981 Area score: -0.1891 step_count: -0.1805 result_resign: 0.1403 result_score: -0.1403 Principal Component 5: cs: 0.2743 step_count: 0.2162 Area_winner_color_W: -0.2052 Area_winner_color_B: 0.2052 Territory_winner_color_B: 0.2001 Territory_winner_color_W: -0.2001 winner_color_white: -0.1990 winner_color_black: 0.1990 Territory_score_of_opponent: -0.1468 Territory_score: -0.1278 Principal Component 6: Territory_score: 0.1707 la: -0.1652 level_2: -0.1544 fj: 0.1515 Territory_score_of_opponent: 0.1491 pp: 0.1246 or: 0.1243 result_resign: 0.1221 result_score: -0.1221 ff: -0.1164 Principal Component 7: color_black: -0.1715 color_white: 0.1715 qc: 0.1533 dp: 0.1531 me: 0.1478 ag: 0.1372 mc: 0.1272 ds: 0.1237 lb: 0.1192 ii: 0.1173 Principal Component 8: ns: 0.3316 ss: 0.1855 rs: 0.1640 aa: 0.1541 ms: 0.1257 is: 0.1252 jg: -0.1211 js: 0.1210 gr: 0.1118 qk: 0.1114 Principal Component 9: ag: -0.1653 gs: 0.1578 hs: 0.1562 cs: -0.1557 af: -0.1458 iq: 0.1443 bm: 0.1425 mm: -0.1244 is: 0.1240 oq: 0.1195 Principal Component 10: qk: 0.1761 la: -0.1503 rp: -0.1480 fp: 0.1460 kl: -0.1366 ds: 0.1300 oi: -0.1254 lb: -0.1189 hd: -0.1184 bc: -0.1146
# Plot the explained variance ratio
plt.figure(figsize=(10, 6))
plt.plot(np.cumsum(explained_variance))
plt.xlabel('Number of Principal Components')
plt.ylabel('Cumulative Explained Variance')
plt.title('Explained Variance by Principal Components (All data)')
plt.grid(True)
plt.show()
# Plot the same explained variance ratio plot as before but only displaying the first 10 principal components
plt.figure(figsize=(10, 6))
plt.plot(np.cumsum(explained_variance)[:10], marker='o')
plt.xlabel('Number of Principal Components')
plt.ylabel('Cumulative Explained Variance')
plt.title('Explained Variance by the first 10 Principal Components (All data)')
plt.xticks(ticks=range(10), labels=range(1, 10 + 1))
plt.grid(True)
plt.show()
Brief insights from PCA for all_data (this includes all features)
Principal Component 1: It mainly deals with the differences in territory and area scores, showing how these scores influence the game's outcome and overall control of the board.
Principal Component 2: It focuses on the effect of game handicaps and the final scores, highlighting how initial advantages or disadvantages impact the result.
Principal Component 3: It is all about the number of moves and how the game's progression ties into scoring patterns, with some emphasis on the order of players and rules used.
Principal Component 4: It captures the role of handicaps and which player starts, showing how these aspects shape the game’s balance and scoring.
Principal Component 5: It looks at how certain player attributes and winning conditions influence the outcome, especially the role of player color and related strategies.
player_info_pca
array([[-2.56972002e-01, -1.13613815e+00, 2.21209921e+00, ...,
-1.23202326e-15, 2.98885460e-16, -1.76560590e-17],
[-2.57278066e-01, -1.13578791e+00, 2.20957852e+00, ...,
1.81602177e-16, -2.05224192e-16, 9.55429946e-18],
[-2.56972002e-01, -1.13613815e+00, 2.21209921e+00, ...,
1.66896907e-16, 2.85589067e-16, -1.18001352e-18],
...,
[ 2.90282725e+00, -2.53119759e+00, 1.88180756e+00, ...,
9.63906110e-18, -3.57986474e-18, 3.69702945e-18],
[ 2.90313332e+00, -2.53154785e+00, 1.88432841e+00, ...,
4.10361198e-18, 1.75310587e-18, -2.98190384e-18],
[ 2.90282725e+00, -2.53119759e+00, 1.88180756e+00, ...,
4.81259742e-18, -1.42628565e-18, 1.91088259e-18]])
# pca_index corresponds to the entire index of player_info_columns
pca_index = player_info_columns.index
# Check if lengths match
print("Length of pca_index:", len(pca_index))
print("Length of player_info_pca:", len(player_info_pca))
# Create the pca_df DataFrame using the first 3 principal components
player_info_pca = pipeline.fit_transform(all_data)
pca_df = pd.DataFrame(player_info_pca[:, :3], columns=['PC1', 'PC2', 'PC3'])
# Use pca_index to get the correct 'level' values from player_info_columns
pca_df['level'] = player_info_columns.loc[pca_index, 'level'].values
# JUST SOME CHECKS
print("Checks begin here...")
print(player_info_columns['level'].unique())
print(player_info_columns['level'].value_counts())
print(pca_df['level'].unique())
print("Checks end here.")
# Plotting PC1 vs PC2
plt.figure(figsize=(8, 6))
sns.scatterplot(data=pca_df, x='PC1', y='PC2', hue='level', palette='viridis')
plt.title('PC1 vs PC2')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.legend(title='Player level', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.show()
# Plotting PC1 vs PC3
plt.figure(figsize=(8, 6))
sns.scatterplot(data=pca_df, x='PC1', y='PC3', hue='level', palette='viridis')
plt.title('PC1 vs PC3')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 3')
plt.legend(title='Player level', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.show()
# Plotting PC2 vs PC3
plt.figure(figsize=(8, 6))
sns.scatterplot(data=pca_df, x='PC2', y='PC3', hue='level', palette='viridis')
plt.title('PC2 vs PC3')
plt.xlabel('Principal Component 2')
plt.ylabel('Principal Component 3')
plt.legend(title='Player level', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.show()
# 3D Plot of PC1, PC2, and PC3
fig = plt.figure(figsize=(10, 8))
ax = fig.add_subplot(111, projection='3d')
sc = ax.scatter(pca_df['PC1'], pca_df['PC2'], pca_df['PC3'], c=pca_df['level'], cmap='viridis')
ax.set_title('3D Plot of PC1, PC2, and PC3')
ax.set_xlabel('Principal Component 1')
ax.set_ylabel('Principal Component 2')
ax.set_zlabel('Principal Component 3')
plt.legend(title='Player level', bbox_to_anchor=(1.05, 1), loc='upper right')
plt.colorbar(sc, label='Level')
plt.show()
Length of pca_index: 9755 Length of player_info_pca: 9755 Checks begin here... [1 2 3] level 2 5677 1 2434 3 1644 Name: count, dtype: int64 [1 2 3] Checks end here.
No artists with labels found to put in legend. Note that artists whose label start with an underscore are ignored when legend() is called with no argument.
fig = px.scatter_3d(pca_df, x=pca_df['PC1'], y=pca_df['PC2'], z=pca_df['PC3'],
color=pca_df['level'])
fig.show()
Comments¶
Which features did you use? Why?
- first 3 Principal components
- level of the player (beginner / intermediate / master)
We used the first 3 PC, which used these features:
Most important features for each principal component:
Principal Component 1: Territory_score_of_opponent: 0.4352 Area score of opponent: 0.4163 Area score: -0.4071 Territory_result: 0.3611 Area_result: 0.3607 Territory_score: -0.3214 level_1: 0.1535 step_count: 0.1069 level_3: -0.0992 winner_color_black: 0.0780
Principal Component 2: handicap: 0.4760 Territory_score: -0.3995 Territory_result: -0.3978 Area_result: -0.3957 Area score: -0.3050 Area score of opponent: 0.1660 starter_player_white: 0.1346 starter_player_black: -0.1346 result_resign: 0.1192 result_score: -0.1192
Principal Component 3: step_count: 0.6217 Area score of opponent: 0.2453 Area_winner_color_B: -0.2103 Area_winner_color_W: 0.2103 Territory_winner_color_B: -0.1943 Territory_winner_color_W: 0.1943 starter_player_white: -0.1703 starter_player_black: 0.1703 Player_mangochia: 0.1702 rules_Japanese: -0.1672
Which projection methods did you use? Why?
- We tried PCA, TSNE and UMAP as well, but PCA generated the nicest and more insightful plots
- From the first PCA plots we can draw the conlcusion that a huge amount of the cumulative variance of the data is explained by them (the PC-s)
Why did you choose these hyperparameters?
- When experimenting with the TSNE, we did a gridsearch
- with PCA, we used the default settings for the projection, and used the first 3 components -we printed out that which principle components explain the data the best
Are there patterns in the global and the local structure?
- players from similars levels follow a similar approach
- we can clearly seperate the different player levels from each other
- players from similars levels follow a similar approach
Meta Data Encoding¶
Encode addtional features in the visualization.
Use features of the source data and include them in the projection, e.g., by using color, opacity, different shapes, or line styles, etc.
# 1. PCA Implementation
# ---------------------------------
"""pca = PCA(n_components=3) # We are interested in the first 3 components
pca_components = pca.fit_transform(player_info_columns[numerical_features])
# Create a DataFrame for easy handling of PCA components
pca_df = pd.DataFrame(pca_components, columns=['PC1', 'PC2', 'PC3'])"""
# Optionally, add meta-data features for encoding
pca_df['level'] = player_info_columns['level'].values
pca_df['winner_color'] = player_info_columns['winner_color'].values
pca_df['result'] = player_info_columns['result'].values
pca_df['starter_player'] = player_info_columns['starter_player'].values
pca_df['Area_winner_color'] = player_info_columns['Area_winner_color'].values
pca_df['Territory_winner_color'] = player_info_columns['Territory_winner_color'].values
pca_df['color'] = player_info_columns['color'].values
pca_df['Our_players_colour'] = player_info_columns['Our_players_colour'].values
# ---------------------------------
# 2. PCA Plots with Meta-Data Encoding
# ---------------------------------
# PC1 vs PC2
plt.figure(figsize=(10, 6))
sns.scatterplot(data=pca_df, x='PC1', y='PC2', hue='level', style='result', palette='viridis')
plt.title('PC1 vs PC2 with Level and Result Encoding')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.legend(title='Level / result', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.grid(True)
plt.show()
plt.figure(figsize=(10, 6))
sns.scatterplot(data=pca_df, x='PC1', y='PC2', hue='Area_winner_color', style='Territory_winner_color', palette='viridis')
plt.title('PC1 vs PC2 with Area_winner_color and Territory_winner_color')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.legend(title='Area_winner_color / Territory_winner_color', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.grid(True)
plt.show()
plt.figure(figsize=(10, 6))
sns.scatterplot(data=pca_df, x='PC1', y='PC2', hue='winner_color', style='Our_players_colour', palette='viridis')
plt.title('PC1 vs PC2 with winner_color and Our player colour')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.legend(title='winner_color / color', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.grid(True)
plt.show()
plt.figure(figsize=(10, 6))
sns.scatterplot(data=pca_df, x='PC1', y='PC2', hue='result', style='Our_players_colour', palette='viridis')
plt.title('PC1 vs PC2 with level and and Our player colour')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.legend(title='level / color', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.grid(True)
plt.show()
# PC1 vs PC3
plt.figure(figsize=(10, 6))
sns.scatterplot(data=pca_df, x='PC1', y='PC3', hue='result', style='level', palette='coolwarm')
plt.title('PCA: PC1 vs PC3 with Result and Level Encoding')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 3')
plt.legend(title='Result / Level', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.grid(True)
plt.show()
# PC2 vs PC3
plt.figure(figsize=(10, 6))
sns.scatterplot(data=pca_df, x='PC2', y='PC3', hue='winner_color', style='result', palette='Set2')
plt.title('PCA: PC2 vs PC3 with Winner Color and Result Encoding')
plt.xlabel('Principal Component 2')
plt.ylabel('Principal Component 3')
plt.legend(title='Winner Color / Result', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.grid(True)
plt.show()
def determine_phase(df):
# Group by game_id
df['phase'] = df.groupby('game_id')['move_id'].transform(
lambda x: ["starter" if step <= 20
else "final" if step > (x.max() - 20)
else "intermediate"
for step in x]
)
return df
# Apply the function to the DataFrame
player_info_columns = determine_phase(player_info_columns)
/tmp/ipykernel_8097/1667704383.py:3: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
pca_df['phase'] = player_info_columns['phase'].values
pca_df['Player']=player_info_columns['Player'].values
plt.figure(figsize=(10, 6))
sns.scatterplot(data=pca_df, x='PC1', y='PC2', hue='winner_color', style='result', alpha=0.7, palette='viridis')
plt.title('PCA: PC1 vs PC2 with winner_color and result')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.legend(title='winner_color / result', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.grid(True)
plt.show()
Here we can see that whether the white player wins, in most cases, his opponent resign. If black wins, in that case most of the time he wins with scores (not resign)
(Black plays first unless given a handicap of two or more stones, in which case White plays first)
plt.figure(figsize=(10, 6))
sns.scatterplot(data=pca_df.query('Player != "mangochia"'), x='PC1', y='PC2', hue=player_info_columns['step_count'], style='result', alpha=0.7, palette='viridis')
plt.title('PCA: PC1 vs PC2 with number of all steps and result')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.legend(title='winner_color / result', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.grid(True)
plt.show()
We can see that the games with a lot of step, like above 360, they usually end without resigning, the players finishes the game until both of them pass that step.
plt.figure(figsize=(10, 6))
sns.scatterplot(data=pca_df.query('Player != "mangochia"'), x='PC1', y='PC2', hue=player_info_columns['step_count'], style='level', alpha=0.7, palette='viridis')
plt.title('PCA: PC1 vs PC2 with winner_color and result')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.legend(title='winner_color / result', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.grid(True)
plt.show()
We can see here that usually the master players have games with much fewer steps.
Comments¶
Which features did you use? Why?
-level -result -area_winner_color -territroy_winner_color -winner_color -Our_players_colour -phase (starter step, intermediate, final steps) -step_count
These were the feautres with the projection plots where we saw some useful insgiths, nice clusters, where we could draw the follwoing conlcusions: (Some examples) - we can see that whether the white player wins, in most cases, his opponent resign. If black wins, in that case most of the time he wins with scores (not resign) - We can see that the games with a lot of step, like above 360, they usually end without resigning, the players finishes the game until both of them pass that step. - We can see that usually the master players have games with much fewer steps.
As for the Link states plots: See below the section Link States
How are the features encoded?
- We built a pipline, where we separated features into numerical and categorical features
- We used StandardScaler for the numerical features
- One-hot encoding for the categorical feautures
Link States¶
Connect the states that belong together.
The states of a single solution should be connected to see the path from the start to the end state. How the points are connected is up to you, for example, with straight lines or splines.
# Apply PCA
pca = PCA(n_components=2)
pca_board_states = pca.fit_transform(board_state_scaled)
# Combine PCA results with game_data
game_data_pca = all_data.copy()
game_data_pca[['PC1', 'PC2']] = pca_board_states
# Determine the phase and add to `game_data_pca`
game_data_pca = determine_phase(game_data_pca)
#Ensure move_id is numeric
game_data_pca['move_id'] = pd.to_numeric(game_data_pca['move_id'], errors='coerce')
# Filter and sort by move_id
filtered_pca_GO_1 = game_data_pca[game_data_pca['game_id'] == 31].sort_values('move_id')
filtered_pca_GO_2 = game_data_pca[game_data_pca['game_id'] == 23].sort_values('move_id')
# Visualization with lines and nodes
lines_1 = alt.Chart(filtered_pca_GO_1).mark_line(
opacity=0.3,
strokeWidth=1.5
).encode(
x='PC1',
y='PC2',
color=alt.Color('phase:N', scale=alt.Scale(domain=['starter', 'intermediate', 'final'], range=['blue', 'green', 'red'])),
order='move_id:Q'
)
nodes_1 = alt.Chart(filtered_pca_GO_1).mark_circle(size=30, opacity=0.6).encode(
x='PC1',
y='PC2',
color=alt.Color('phase:N', scale=alt.Scale(domain=['starter', 'intermediate', 'final'], range=['blue', 'green', 'red']))
)
# Visualization with lines and nodes
lines_2 = alt.Chart(filtered_pca_GO_2).mark_line(
opacity=0.3,
strokeWidth=1.5
).encode(
x='PC1',
y='PC2',
color=alt.Color('phase:N', scale=alt.Scale(domain=['starter', 'intermediate', 'final'], range=['blue', 'green', 'red'])),
order='move_id:Q'
)
nodes_2 = alt.Chart(filtered_pca_GO_2).mark_circle(size=30, opacity=0.6).encode(
x='PC1',
y='PC2',
color=alt.Color('phase:N', scale=alt.Scale(domain=['starter', 'intermediate', 'final'], range=['blue', 'green', 'red']))
)
# Combine lines and nodes
path_chart_with_nodes_1 = (lines_1 + nodes_1).properties(
width=500,
height=500,
title="Paths of Begginer Player"
).interactive()
path_chart_with_nodes_2 = (lines_2 + nodes_2).properties(
width=500,
height=500,
title="Paths of Master Player"
).interactive()
path_chart_with_nodes_1 | path_chart_with_nodes_2
This analysis visualizes the progression of moves for a master player (game_id = 23) and a beginner player (game_id = 31) in a Go game. The data has been processed using Principal Component Analysis (PCA) to reduce the dimensionality of the game state data, enabling a 2D visualization of each player's moves.
Color Coding by Game Phase: The moves are color-coded based on the phase of the game:
- Blue: Opening phase (starter)
- Green: Middle game phase (intermediate)
- Red: Endgame phase (final)
Line and Node Visualization:
- Lines represent the sequential progression of moves, ordered by
move_id. - Nodes represent individual moves at each point in the sequence.
- Lines represent the sequential progression of moves, ordered by
By displaying both visualizations side by side, we can compare the strategic choices and decision-making patterns of the master player versus the beginner player.
We can from the plot that the master player has significantly less moves.
player_info_columns['Date'] = player_info_columns['Date'].astype(str)
# t-SNE Projection
tsne = TSNE(n_components=2, perplexity=30, n_iter=300)
tsne_result = tsne.fit_transform(board_state_columns)
player_info_columns['tSNE1'] = tsne_result[:, 0]
player_info_columns['tSNE2'] = tsne_result[:, 1]
chart1 = alt.Chart(player_info_columns.query('move_id > 250 and level == 1')).mark_line(
opacity=0.6,
point=alt.MarkConfig(size=50)
).encode(
x='move_id:Q',
y='tSNE2:Q',
color='game_id',
detail='game_id:N',
order='move_id:Q',
tooltip=['game_id', 'move_id', 'winner_color']
).properties(
width=700,
height=400,
title="Move Trajectories by Winner Color - End - Level 1"
).interactive()
chart2 = alt.Chart(player_info_columns.query('move_id > 300 and level == 2')).mark_line(
opacity=0.6,
point=alt.MarkConfig(size=50)
).encode(
x='move_id:Q',
y='tSNE2:Q',
color='game_id',
detail='game_id:N',
order='move_id:Q',
tooltip=['game_id', 'move_id', 'winner_color']
).properties(
width=700,
height=400,
title="Move Trajectories by Winner Color - End - Level 2"
).interactive()
chart3 = alt.Chart(player_info_columns.query('move_id > 300 and level == 3')).mark_line(
opacity=0.6,
point=alt.MarkConfig(size=50)
).encode(
x='move_id:Q',
y='tSNE2:Q',
color='game_id',
detail='game_id:N',
order='move_id:Q',
tooltip=['game_id', 'move_id', 'winner_color']
).properties(
width=700,
height=400,
title="Move Trajectories by Winner Color - End - Level 3"
).interactive()
alt.concat(chart1, chart2, chart3, columns=3).resolve_scale(color='independent')
/tmp/ipykernel_8097/1591916745.py:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy /tmp/ipykernel_8097/1591916745.py:6: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy /tmp/ipykernel_8097/1591916745.py:7: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
The plots above show the last steps of some of the games, each plot showing only games from one level. Different colors indicate different game_id's, i.e. different games. It is visible, that most games end at the same point, at value 0 on the y axis. This indicates a common end stategy of the games. It is visible, that one game from the level 1 games does not follow this pattern. This may be explained by different game stategies of beginner players.
# PCA
pca_df['move_id'] = player_info_columns['move_id'].values
pca_df['game_id'] = player_info_columns['game_id'].values
player_info_columns['Date'] = player_info_columns['Date'].astype(str)
alt.Chart(pca_df.query('move_id < 10')).mark_line(
opacity=0.6,
point=alt.MarkConfig(size=50)
).encode(
x='move_id:Q',
y='PC1:Q',
color='level:N',
detail='game_id:N',
order='move_id:Q',
tooltip=['game_id', 'move_id', 'winner_color']
).properties(
width=700,
height=400,
title="Move Trajectories by Winner Color using PCA"
).interactive()
/tmp/ipykernel_8097/2616770832.py:6: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
The graph above shows the first ten steps of the investigated GO games, the colors indicating the levels of the individual games. It is clearly visible, that the areas in the plot can approximately be divided according to the three levels. From this it can be assumed, that players of each level seem to follow similar approaches.
alt.Chart(player_info_columns).mark_rect().encode(
x=alt.X('game_id:O', title='Game ID', axis=alt.Axis(labelAngle=-45)),
y=alt.Y('level:O', title='Level'),
color=alt.Color('count(move_id):Q', scale=alt.Scale(scheme='viridis')),
tooltip=['game_id', 'level', 'count(move_id)']
).properties(
width=700,
height=400,
title="Heatmap of Move Counts by Game ID and Level"
).interactive()
The heatmap above shows the number of moves of the individual games, sorted by the different levels and games. Except for individual outliers, it can be seen, that games of level one and three had less moves, i.e. they were shorter, than games of the intermediate level. This may be explained by the following: Beginners don't have good stategies to win and therefore loose the game quite quickly. Experts have good winning stategies and therefore win quickly. Intermediate players, however, have stategies to not loose immediately, but they also don't win immediately.
Optional¶
Projection Space Explorer (click to reveal)
Projection Space Explorer
The Projection Space Explorer is a web application to plot and connect two dimensional points. Metadata of the data points can be used to encode additonal information into the projection, e.g., by using different shapes or colors.
Further Information:
- Paper: https://jku-vds-lab.at/publications/2020_tiis_pathexplorer/
- Repo: https://github.com/jku-vds-lab/projection-space-explorer/
- Application Overview: https://jku-vds-lab.at/pse/
Data Format
How to format the data can be found in the Projection Space Explorer's README.Example data with three lines, with two colors (algo) and additional mark encoding (cp):
| x | y | line | cp | algo |
|---|---|---|---|---|
| 0.0 | 0 | 0 | start | 1 |
| 2.0 | 1 | 0 | state | 1 |
| 4.0 | 4 | 0 | state | 1 |
| 6.0 | 1 | 0 | state | 1 |
| 8.0 | 0 | 0 | state | 1 |
| 12.0 | 0 | 0 | end | 1 |
| -1.0 | 10 | 1 | start | 2 |
| 0.5 | 5 | 1 | state | 2 |
| 2.0 | 3 | 1 | state | 2 |
| 3.5 | 0 | 1 | state | 2 |
| 5.0 | 3 | 1 | state | 2 |
| 6.5 | 5 | 1 | state | 2 |
| 8.0 | 10 | 1 | end | 2 |
| 3.0 | 6 | 2 | start | 2 |
| 2.0 | 7 | 2 | end | 2 |
Save the dataset to CSV, e.g. using pandas: df.to_csv('data_path_explorer.csv', encoding='utf-8', index=False)
and upload it in the Projection Space Explorer by clicking on OPEN FILE in the top left corner.
ℹ You can also include your high dimensionmal data and use it to adapt the visualization.
Results¶
You may add additional screenshots of the Projection Space Explorer.
Interpretion¶
What can be seen in the projection(s)?
- Players of the same level have similar approaches
- Players of different levels have different approaches (master player has different approach than beginner player)
- On average, master games are shorter (contain less steps) than beginner games
- Beginner games tend to include more abrupt phase transitions (for example tend to end games more abruptly)
- Ending states are not so diverse (Most games end at the same point)
- There are some games where different players won based on Area and Territory scores
- Most master level games tend to end with resigning, it is also a tendency for beginner games, but not necessarily for intermediate games
Was it what you expected? If not what did you expect?
Mostly yes, but there were some exceptions:
- Ending states are not so diverse → we expected to identify more diverse ending states
- For both master and beginner games it is a tendency to end the game with the opponent resigning → we only expected to see this for master games
Can you confirm prior hypotheses from the projection?
Yes, for example these ones:
- Different approaches by the expertise level of the players: master players display more sophisticated strategies
- On average, master games are shorter (contain less steps) than beginner games
Did you get any unexpected insights?
- Players of the same level have similar approaches
- Ending states are not so diverse (Most games end at the same point)
- Most master level games tend to end with resigning, it is also a tendency for beginner games, but not necessarily for intermediate games
Submission¶
When you’ve finished working on this assignment please download this notebook as HTML and add it to your repository in addition to the notebook file.